Introduction

We investigate a dataset about white wine and the influence of chemical properties to the rating of wine experts. The total dataset contains 4898 samples with 11 attributes and 1 test result (median of at least three evaluations between 0 - bad to 10 - excellent quality). This dataset comes from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

There are 11 physicochemical input variables:

We add a qualitative variable (wine category) to the dataset, which describes how sweet the wine is (based on the residual sugar). If the residual sugar is below 4 g / dm3 then the wine is dry. Between 4 and 12 g / dm3 the wine is medium dry and between 12 and 45 g / dm3 the wine is medium sweet. Above 45 g / dm3 residual sugar the wine is sweet. The output variable quality is a score between 0 and 10 and is based on sensory data.

Univariate Plots Section

Wine quality seems normally distributed and 93% of all wines have a quality rating between 5 and 7 points.

There are a some very big values for chlorides level, which are more than four times bigger than the median value. It will be interesting to investigate the chloride’s influence on wine quality rating.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations, 11 input variables (all numerical), one categorical variable and one output variable:

## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ category            : Factor w/ 4 levels "dry","medium dry",..: 3 1 2 2 2 2 2 3 1 1 ...

The different wine categories “dry”, “medium dry” and “medium sweet” have different distributions of the output variable quality. No observation of the wine category “sweet” is in the dataset.

##    
##     dry medium dry medium sweet sweet
##   3   9          8            3     0
##   4  93         58           12     0
##   5 513        623          321     0
##   6 924        896          377     1
##   7 458        318          104     0
##   8  78         73           24     0
##   9   3          2            0     0

What is/are the main feature(s) of interest in your dataset?

We want to predict the wine quality based on the input features. Hence, it is most interesting which other features most influence this variable. Based on intuition, the features volatile acidity, chlorides, total sulfur dioxide and sulphates are important, because if the level is too high then the wine quality will be bad.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The features citric acid and residual sugar are expected to improve the wine quality. The volume percent of alcohol is not expected to influence the quality of the wine.

Did you create any new variables from existing variables in the dataset?

Besides the qualitative variable (wine category, see Inroduction), 5 more variables were added to the dataset:

  • free sulfur proportion: ratio of free to total sulfur dioxide
  • volatile acid proportion: ratio of volatile to fixed acidity
  • citric-volatile-ratio: ratio of citric acid to volatile acidity
  • citric-sugar-ratio: ratio of citric acid to residual sugar
  • volatile-sugar-ratio: ratio of volatile acidity to residual sugar

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Several features have a skewed distribution or some outliers on the right side. Hence, to make the analysis more robust, for the following variables only values below the 99th quantile are considered:

  • sulphates
  • density
  • total sulfur dioxide
  • free sulfur dioxide
  • chlorides
  • residual sugar
  • citric acid
  • volatile acidity
  • fixed acidity

After the cleaning, 339 observations are removed from the dataset. It now has 4559 observations remaining.

Bivariate Plots Section

Correlation Matrix

Surprisingly, alcohol level is positively correlated with wine quality. Chlorides, total sulfur dioxide and volatile-acid-proportion have a negative correlation with wine quality.

Negative Correlation Quality vs. Input Features

Positive Correlation Quality vs. Input Features

Most wines have a quality rating between 5 and 7 points. The median alcohol level for wines with a rating of 5 points is 9.6%, for wines with a rating of 6 points 10.5% and for wines with a rating of 7 points 11.4%.

Negative Correlation Alcohol vs. Other Input Features

Wine quality and alcohol level are positively correlated. If alcohol level and other input features have a negative correlation then there can be a an indirect negative influence on wine quality (maybe no linear influence of the other input feature).

Quality and Wine Category

The median wine quality i equal no matter what wine category is present.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Surprisingly, there is a strong linear correlation between alcohol and quality. On the other hand there is only a weak linear relationship between quality and volatile acidity, chlorides and total sulfur dioxide. Additionally, there is a positive correlation between quality and free sulfur proportion as well as quality and the citric-volatile-ratio, both variables which were introduced during the analysis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The variable alcohol is stronger negative linear correlated to the variables residul sugar (it is obvious, because sugar is transformed into alcohol during fermentation), chlorides, free sulfur dioxide, total sulfur dioxide and density (again this is obvious, because alcohol has a lower density than water - a higher alcohol level means a lower water level). Hence, a higher alcohol level comes with a lower level of chlorides and sulfur dioxide and improves the quality in this combination.

What was the strongest relationship you found?

The strongest positive correlation is between volatile acidity and the volatile acid proportion with r = 0.935. The biggest negative correlation have the variables alcohol and density with r= -0.813. Both correlations are plausible.

Multivariate Plots Section

Wines with a lower quality rating have a higher chloride level and a lower alcohol level.

Wines with a lower quality rating have a higher free sulfur dioxide level and a lower alcohol level.

Alcohol level and free sulfur dioxide level have a negative correlation. When the alcohol level is low there is a high chloride level a all levels of free sulfur dioxide..

There is an area of high wine quality for a total sulfur dioxide level of 80-130 mg/dm3 and a chloride level of 0.02-0.04 g/dm3.

The area of high wine quality is the same for all wine categories.

There is an area of high wine quality for a total sulfur dioxide level of 80-130 mg/dm3 and a free sulfur dioxide level of 30-50 mg/dm3.

There is an area of high wine quality for a volatile acidity level of 0.15-0.3 g / dm3 and a Citric-Volatile-Ratio of 1-2.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The strong relation between wine quality and alcohol level, which is not intuitive, is based on more complex relationships. Both chlorides, free sulfur dioxide and total sulfur dioxide are negativly correlated with alcohol. Hence, chlorides, free sulfur dioxide and total sulfur dioxide have a negative influence on wine quality - however, no linear relationship. For example, if the level of chlorides is too high, the wine quality goes down, no matter what level of free sulfur dioxide is present. Vice versa, a low level of free sulfur dioxide does not guarantee a high wine quality because of a too high chlorides level. A high level of alcohol has lower levels of both chlorides, free sulfur dioxide and total sulfur dioxide and a higher wine quality consequently.

Were there any interesting or surprising interactions between features?

There is a decreasing level of citric-volatile-ratio (ratio of citric acid to volatile acidity) over volatile acidity with a high wine quality. Apparently, the amount of citric acid level needs to be decreased if the total level of volatile acid increases to get a wine of good quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + chlorides, data = wine)
## m3: lm(formula = quality ~ alcohol + chlorides + free.sulfur.ratio, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + chlorides + free.sulfur.ratio + 
##     volatile.ratio, data = wine)
## m5: lm(formula = quality ~ alcohol + chlorides + free.sulfur.ratio + 
##     volatile.ratio + citric.sugar.ratio, data = wine)
## m6: lm(formula = quality ~ alcohol + chlorides + total.sulfur.dioxide, 
##     data = wine)
## 
## ===========================================================================================
##                            m1         m2         m3         m4          m5         m6      
## -------------------------------------------------------------------------------------------
##   (Intercept)            2.561***   2.978***   2.608***   2.822***    2.682***   2.655***  
##                         (0.101)    (0.133)    (0.134)    (0.134)     (0.133)    (0.160)    
##   alcohol                0.317***   0.294***   0.288***   0.304***    0.334***   0.311***  
##                         (0.010)    (0.011)    (0.010)    (0.010)     (0.011)    (0.011)    
##   chlorides                        -4.137***  -3.592***  -3.027***   -2.085*    -4.337***  
##                                    (0.860)    (0.846)    (0.836)     (0.835)    (0.860)    
##   free.sulfur.ratio                            1.623***   1.420***    1.295***             
##                                               (0.127)    (0.127)     (0.126)               
##   volatile.ratio                                         -8.608***  -10.527***             
##                                                          (0.782)     (0.802)               
##   citric.sugar.ratio                                                 -0.995***             
##                                                                      (0.109)               
##   total.sulfur.dioxide                                                           0.001***  
##                                                                                 (0.000)    
## -------------------------------------------------------------------------------------------
##   R-squared                  0.2        0.2        0.2        0.2        0.3         0.2   
##   adj. R-squared             0.2        0.2        0.2        0.2        0.3         0.2   
##   sigma                      0.8        0.8        0.8        0.8        0.8         0.8   
##   F                       1108.1      568.4      446.6      374.2      321.4       384.3   
##   p                          0.0        0.0        0.0        0.0        0.0         0.0   
##   Log-likelihood         -5369.6    -5358.1    -5278.0    -5218.1    -5176.7     -5351.5   
##   Deviance                2814.7     2800.4     2703.8     2633.6     2586.2      2792.4   
##   AIC                    10745.3    10724.2    10566.1    10448.2    10367.3     10713.0   
##   BIC                    10764.6    10749.9    10598.2    10486.7    10412.3     10745.1   
##   N                       4559       4559       4559       4559       4559        4559     
## ===========================================================================================

The best model m5 explains the wine quality with the features alcohol level, chlorides, free sulfur ratio, volatile ratio and citric sugar ratio. All coefficients are statistically significant different from zero. The basic wine quality (intercept) is 2.7 and is increased by a higher alcohol level and a higher free sulfur ratio. The quality is decreased by a higher chloride level, a higher volatile ratio and a higher citric-sugar-ratio. The coefficients are consistent with the correlation matrix. However, the linear model is weak in modeling complex relationships between alcohol level, chlorides and total sulfur dioxide as explained above. In model m6 a higher total sulfur dioxide level will very slightly increase the wine quality. In the corraltion matrix there was a negative correlation between quality and total sulfur dioxide.


Final Plots and Summary

For the final plots, two interesting properties of wine, which influences the quality, are shown. The first two plots show the distribution of chloride level and the influence of chloride level to the wine quality. The third plot reveals a realtionship between citric-volatile-ratio (ratio of citric acid to volatile acidity) and volatile acidity and shows an area, where many wines have a high quality rating.

Plot One

Description One

This plot shows the distibution of chlorides in all wines and the median with a vertical line. It appears to be a unimodal distribution. However, there are several wines which contain a considerable higher amount of salt, even three times more than the median. 90% of the chloride levels of all wines are between 0.027 and 0.063 g / dm3.

Plot Two

Description Two

The plot shows the chlorides level and total sulfur dioxide of the wines and is colored with the wine quality. There is a slight positive correlation between chlorides and total sulfur dioxide. The quality of a wine is low, if the chloride level or the total sulfur dioxide is too high. On the other hand there is an area of chlorides level and total sulfur dioxide where the wine quality is higher on average (read area in the plot). This seems to be a good combination for a good taste. This area of high wine quality has a total sulfur dioxide level of 80-130 mg/dm3 and a chloride level of 0.02-0.04 g/dm3. However, if both levels are lower, the wine quality is low, too.

Plot Three

Description Three

The plot shows the relationship between the citric-volatile-ratio (ratio of citric acid to volatile acidity) and volatile acidity and is colored by wine quality. There is no linear relationship between both features, however, a higher level of volatile acidity corresponds to a lower level of Citric-Volatile-Ratio. Wine with a high quality has a medium citric-volatile-ratio and the good ratio decreases with an increasing volatile acidity. There is an area of high wine quality for a volatile acidity level of 0.15-0.3 g/dm3 (approximately around the median of volatile acidity 0.26 g/dm3) and a Citric-Volatile-Ratio of 1-2 (also approximately around the median of Citric-Volatile-Ratio 1.27).


Reflection

The wine data set contains information about about 4500 different white wine and the influence of chemical properties to the rating of wine experts. At first, I started to explore each input variable on its own and created five new variables (ratios) to answer questions about relative amounts. After exploring each variable I assumed that the variables volatile acidity, chlorides, total sulfur dioxide and sulphates are important for wine quality, because if its level is too high, the wine quality will decrease. However, surprisingly there was no high correlation between quality and the four variables. There was only a high correlation between alcohol level and wine quality - a relationship that was unexpected.The strong relation between wine quality and alcohol level is based on more complex relationships. Both chlorides, free sulfur dioxide and total sulfur dioxide are negativly correlated with alcohol. Hence, clorides, free sulfur dioxide and total sulfur dioxide have a negative influence on wine quality - however, no linear relationship. For example, if the level of clorides is too high, the wine quality goes down, no matter what level of free sulfur dioxide is present. Vice versa, a low level of free sulfur dioxide does not guarantee a high wine quality because of a too high clorides level. I tried then to fit al linear model to predict the wine quality based on the input variables alcohol level, chlorides, free sulfur ratio, volatile ratio and citric sugar ratio. The coefficients of the linear model are consistent with the correlation matrix. However, the linear model is weak in modeling complex relationships between alcohol level, chlorides and total sulfur dioxide as explained above. To investigate this data further,I would try to find a better linear model which can represent the complex relationships between the different input variables such that the alcohol level is no more needed to predict the wine quality.

Sources